53 research outputs found

    Deeplogger: Extracting user input logs from 2D gameplay videos

    Get PDF
    Game and player analysis would be much easier if user interactions were electronically logged and shared with game researchers. Understandably, sniffing software is perceived as invasive and a risk to privacy. To collect player analytics from large populations, we look to the millions of users who already publicly share video of their game playing. Though labor-intensive, we found that someone with experience of playing a specific game can watch a screen-cast of someone else playing, and can then infer approximately what buttons and controls the player pressed, and when. We seek to automatically convert video into such game-play transcripts, or logs. We approach the task of inferring user interaction logs from video as a machine learning challenge. Specifically, we propose a supervised learning framework to first train a neural network on videos, where real sniffer/instrumented software was collecting ground truth logs. Then, once our DeepLogger network is trained, it should ideally infer log-activities for each new input video, which features gameplay of that game. These user-interaction logs can serve as sensor data for gaming analytics, or as supervision for training of game-playing AI’s. We evaluate the DeepLogger system for generating logs from two 2D games, Tetris [23] and Mega Man X [6], chosen to represent distinct game genres. Our system performs as well as human experts for the task of video-to-log transcription, and could allow game researchers to easily scale their data collection and analysis up to massive populations

    Learning Dilation Factors for Semantic Segmentation of Street Scenes

    Full text link
    Contextual information is crucial for semantic segmentation. However, finding the optimal trade-off between keeping desired fine details and at the same time providing sufficiently large receptive fields is non trivial. This is even more so, when objects or classes present in an image significantly vary in size. Dilated convolutions have proven valuable for semantic segmentation, because they allow to increase the size of the receptive field without sacrificing image resolution. However, in current state-of-the-art methods, dilation parameters are hand-tuned and fixed. In this paper, we present an approach for learning dilation parameters adaptively per channel, consistently improving semantic segmentation results on street-scene datasets like Cityscapes and Camvid.Comment: GCPR201

    Virtual Adversarial Ladder Networks For Semi-supervised Learning

    Get PDF
    Semi-supervised learning (SSL) partially circumvents the high cost of labeling data by augmenting a small labeled dataset with a large and relatively cheap unlabeled dataset drawn from the same distribution. This paper offers a novel interpretation of two deep learning-based SSL approaches, ladder networks and virtual adversarial training (VAT), as applying distributional smoothing to their respective latent spaces. We propose a class of models that fuse these approaches. We achieve near-supervised accuracy with high consistency on the MNIST dataset using just 5 labels per class: our best model, ladder with layer-wise virtual adversarial noise (LVAN-LW), achieves 1.42%±0.12 average error rate on the MNIST test set, in comparison with 1.62%±0.65 reported for the ladder network. On adversarial examples generated with L2-normalized fast gradient method, LVAN-LW trained with 5 examples per class achieves average error rate 2.4%±0.3 compared to 68.6%±6.5 for the ladder network and 9.9%±7.5 for VAT

    Help, It Looks Confusing: GUI Task Automation Through Demonstration and Follow-up Questions

    Get PDF
    Non-programming users should be able to create their own customized scripts to perform computer-based tasks for them, just by demonstrating to the machine how it's done. To that end, we develop a system prototype which learns-by-demonstration called HILC (Help, It Looks Confusing). Users train HILC to synthesize a task script by demonstrating the task, which produces the needed screenshots and their corresponding mouse-keyboard signals. After the demonstration, the user answers follow-up questions. We propose a user-in-the-loop framework that learns to generate scripts of actions performed on visible elements of graphical applications. While pure programming-by-demonstration is still unrealistic, we use quantitative and qualitative experiments to show that non-programming users are willing and effective at answering follow-up queries posed by our system. Our models of events and appearance are surprisingly simple, but are combined effectively to cope with varying amounts of supervision. The best available baseline, Sikuli Slides, struggled with the majority of the tests in our user study experiments. The prototype with our proposed approach successfully helped users accomplish simple linear tasks, complicated tasks (monitoring, looping, and mixed), and tasks that span across multiple executables. Even when both systems could ultimately perform a task, ours was trained and refined by the user in less time

    RecurBot: Learn to auto-complete GUI tasks from human demonstrations

    Get PDF
    On the surface, task-completion should be easy in graphical user interface (GUI) settings. In practice however, different actions look alike and applications run in operating-system silos. Our aim within GUI action recognition and prediction is to help the user, at least in completing the tedious tasks that are largely repetitive. We propose a method that learns from a few user-performed demonstrations, and then predicts and finally performs the remaining actions in the task. For example, a user can send customized SMS messages to the first three contacts in a school’s spreadsheet of parents; then our system loops the process, iterating through the remaining parents. First, our analysis system segments the demonstration into discrete loops, where each iteration usually included both intentional and accidental variations. Our technical innovation approach is a solution to the standing motif-finding optimization problem, but we also find visual patterns in those intentional variations. The second challenge is to predict subsequent GUI actions, extrapolating the patterns to allow our system to predict and perform the rest of a task. We validate our approach on a new database of GUI tasks, and show that our system usually (a) gleans what it needs from short user demonstrations, and (b) auto-completes tasks in diverse GUI situations

    Harmonic Networks: Deep Translation and Rotation Equivariance

    Get PDF
    Translating or rotating an input image should not affect the results of many computer vision tasks. Convolutional neural networks (CNNs) are already translation equivariant: input image translations produce proportionate feature map translations. This is not the case for rotations. Global rotation equivariance is typically sought through data augmentation, but patch-wise equivariance is more difficult. We present Harmonic Networks or H-Nets, a CNN exhibiting equivariance to patch-wise translation and 360-rotation. We achieve this by replacing regular CNN filters with circular harmonics, returning a maximal response and orientation for every receptive field patch. H-Nets use a rich, parameter-efficient and low computational complexity representation, and we show that deep feature maps within the network encode complicated rotational invariants. We demonstrate that our layers are general enough to be used in conjunction with the latest architectures and techniques, such as deep supervision and batch normalization. We also achieve state-of-the-art classification on rotated-MNIST, and competitive results on other benchmark challenges

    Structured prediction of unobserved voxels from a single depth image

    Get PDF
    Building a complete 3D model of a scene, given only a single depth image, is underconstrained. To gain a full volumetric model, one needs either multiple views, or a single view together with a library of unambiguous 3D models that will fit the shape of each individual object in the scene. We hypothesize that objects of dissimilar semantic classes often share similar 3D shape components, enabling a limited dataset to model the shape of a wide range of objects, and hence estimate their hidden geometry. Exploring this hypothesis, we propose an algorithm that can complete the unobserved geometry of tabletop-sized objects, based on a supervised model trained on already available volumetric elements. Our model maps from a local observation in a single depth image to an estimate of the surface shape in the surrounding neighborhood. We validate our approach both qualitatively and quantitatively on a range of indoor object collections and challenging real scenes

    Patch based synthesis for single depth image super-resolution

    Get PDF
    We present an algorithm to synthetically increase the resolution of a solitary depth image using only a generic database of local patches. Modern range sensors measure depths with non-Gaussian noise and at lower starting resolutions than typical visible-light cameras. While patch based approaches for upsampling intensity images continue to improve, this is the first exploration of patching for depth images. We match against the height field of each low resolution input depth patch, and search our database for a list of appropriate high resolution candidate patches. Selecting the right candidate at each location in the depth image is then posed as a Markov random field labeling problem. Our experiments also show how important further depth-specific processing, such as noise removal and correct patch normalization, dramatically improves our results. Perhaps surprisingly, even better results are achieved on a variety of real test scenes by providing our algorithm with only synthetic training depth data

    ICNet for Real-Time Semantic Segmentation on High-Resolution Images

    Full text link
    We focus on the challenging task of real-time semantic segmentation in this paper. It finds many practical applications and yet is with fundamental difficulty of reducing a large portion of computation for pixel-wise label inference. We propose an image cascade network (ICNet) that incorporates multi-resolution branches under proper label guidance to address this challenge. We provide in-depth analysis of our framework and introduce the cascade feature fusion unit to quickly achieve high-quality segmentation. Our system yields real-time inference on a single GPU card with decent quality results evaluated on challenging datasets like Cityscapes, CamVid and COCO-Stuff.Comment: ECCV 201

    Predicting the Perceptual Demands of Urban Driving with Video Regression

    Get PDF
    To drive safely requires perceiving vast amounts of rapidly changing visual information. This can exhaust our limited perceptual capacity and lead to cases of 'looking but failing to see', reportedly the third largest contributing factor to road traffic accidents. In the present work we use a 3D convolutional neural network to model the perceptual demand of varied driving situations. To validate the method we introduce a new labelled dataset of approximately 2300 videos of driving in Brussels and California
    • …
    corecore